White Wine Quality Exploration Report by Shaomeng Chen

Introduction

This project will explore a dataset about chemical properties of white wines in order to answer the following question: “Which chemical properties influence the quality of white wines?” The dataset contains information about 4,898 white wines with 11 variable on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Overview of the Data Set

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

We can see that the data set consists of 12 variables and an index column nameed X, with 4,898 observations.And the variables are all in type of numeric. Quality is in type of int.The amount of volatile acidity, citric acid, chlorides and sulphates are very small, while the amount of free sulfur dioxide and total sulfur dioxide are very large.The qualities of all are between 3 and 9.

Univariate Plots Section

In order to see their distribution, let’s analyze every single variable by plotting their histograms. It will also be helpful to recognize outliers.

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

We can see that the distribution of quality looks like normal, with a mean of 5.878 and a median of 6. And the peak is at 6.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

We can see that the distribution seems to be normal. Also, the majority of fixed acidity is from 5 to 8. And the peak is at 6.8.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

After removing the outliers to the right, we can see that the distribution seems to be normal, and a little right tail. Also, the majority of volatile acidity is from 0.15 to 0.35.And the peak is at 0.24

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

After removing the outliers to the right, we can see that the distribution seems to be normal except for a peak at 0.48.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Transformed the long tail data by using logscale on x-axis to better understand the distribution of residual sugar. The tranformed distribution appears bimodal with a peak around 1.25 and a peak around 8.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

After removing the outliers to the right, we can see that the distribution seems to be normal with a peaks at 0.045.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

After removing the outliers to the right, we can see that the distribution seems to be normal with a peak at 28, and a little right tail.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

After removing the outliers to the right, we can see that the distribution seems to be normal with a peaks at 120.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

After removing the outliers to the right, we can see that the distribution seems to be normal with a peaks at 0.0993.

PH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

We can see that the distribution seems to be normal with a peaks at 3.14.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

We can see that the distribution seems to be normal with a peak at 0.46, and a little right tail.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Transformed the long tail data by using logscale on x-axis to better understand the distribution of residual sugar. The tranformed distribution appears bimodal with a peak around 10.

New Variables

Since there are fixed acidity and volatile acidity, free sulfur dioxide and total sulfur dioxide. I create following 4 new variables, which may helpful to find out factors that influence quality.

  • total acidity, computing method: fixed acidity + volatile acidity;
  • fixed acidity ratio, computing method: fixed acidity / total acidity;
  • non_free sulfur dioxide, computing method: total sulfur dioxide - free sulfur dioxide;
  • free sulfur dioxide ratio, computing method: free sulfur dioxide / total sulfur dioxide;

Total Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.110   6.570   7.070   7.133   7.590  14.470

After removing the outliers to the right, we can see that the distribution seems to be normal with a peaks at 7.

Fixed Acidity Ratio

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8472  0.9538  0.9631  0.9606  0.9706  0.9890

We can see that the distribution seems to be left skew distribution with a peaks at 0.965.

Non Free Slfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

After removing the outliers to the right, we can see that the distribution seems to be normal with a peaks at 85, and a little right tail.

Free Sulfur Dioxide Ratio

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02362 0.19090 0.25370 0.25560 0.31580 0.71050

After removing the outliers to the right, we can see that the distribution seems to be normal with a peaks at 0.27.

Univariate Analysis

What is the structure of your dataset?

There are 4,898 white wines in the dataset with 12 features,with an index column ‘X’. The variable quality is in type of int, while others are in type of numeric.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set is quality. I’m interested in findind out which features influnce the quality and how to predict the quality by these features.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think all these features will influence the quality of white wine ,especialy alcohol.

Did you create any new variables from existing variables in the dataset?

Yes, I create 4 new variables: Two depending on fixed acidity and volatile acidity, the other depending on free sulfur dioxide and total sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There are some outliers in thoese distributions. In order to observe distribution more accuracy, I removed these outliers when plotting histograms. But I yhink these outeliers are not incorrect, so I will keep these outliers.

Bivariate Plots Section

Now, let’s explore the correlations between variables.

We can see that quality has relatively strong correlations with density and alcohol. The correlation coefficients are -0.44 and -0.31, respectively. Besides, the absolute value of correlation coefficients between quality and chlorides, non_free sulfur dioxide, free sulfur dioxide ratio also are all above 0.2. So, we will make a deep exploration between quality and these 5 variables respectively. Except for the 4 varibles which come from existing variables, residual sugar and density has the strongest relationship with a correlation coefficient of 0.84.So, we will make a further exploration between these two variables too.

Quality vs Alcohol

For the convenience of showing the relationships between quality and other variables by plots, I will create a new variable named quality_f by changing the type of quality into ordered factor.

We can see the tendency that wines with higher alcohol percentage have higher quality.

The boxplot also shows the same conclusion as scatterplot when quality is greater than 5.

Quality vs Density

We can see the tendency that wines with lower density have higher quality.

The boxplot also shows the same conclusion as scatterplot when quality is greater than 5. The medians density of wines with quality between 3 and 5 have no big difference.

Quality vs Chlorides

We can see the tendency that wines with higher chlorides have higher quality.

The boxplot also shows the same conclusion as scatterplot when quality is greater than 5.And medians density of wines with quality between 3 and 5 have no big difference.

Quality vs Non_free Sulfur Dioxide

We can see the tendency that wines with lower non_free sulfur dioxide have higher quality.

The boxplot also shows the same conclusion as scatterplot when quality is greater than 5.

Quality vs Free Sulfur Dioxide Ratio

We can see the tendency that wines with higher free sulfur dioxide ratio have higher quality.

The boxplot also shows the same conclusion as scatterplot when quality is greater than 4.

Residual Sugar vs Density

Obeviously, we can see the tendency that higher residual sugar lead to higher density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

After observing the correlationships between every two variables, we can see that alcohol has the strongest relationship with quality. Besides, density, chlorides, non_free sulfur dioxide, free sulfur dioxide ratio all have absolute value of correlation coefficients above 0.2 with quality, by which I think the relationship is worthy of attention. These 5 feature are the most important factor that influence the quality of white wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I also observed the relationships between residual sugar and density. We can see that higher residual sugar lead to higher density.

What was the strongest relationship you found?

Except for the 4 new variables which come from existing variables, residual sugar and density has the strongest relationship with a correlation coefficient of 0.84. And we can see higher residual sugar lead to higher density clearly.

Multivariate Plots Section

Except for the 4 new variables, the absolute correlation coefficient values of residual sugar and density, alcohol and density, free sulfur dioxide and total sulfur dioxide, density and total sulfur dioxide are all above 0.5, which I think mean strong relationships. I’m curious about their relationships between quality. Now I’ll expolre the relationships between those 4 pairs and quality by a scatterplot as follow:

We can see a strong positive correlation between density and residual sugar. But the quality only seems to has relationship with density, not residual sugar. As a whole, I can say that the combination of density and residual sugar has no obvious relationship with quality.

Again, we can see a strong positive correlation between density and alcohol. Besides, we can see that wines with high quality tending to be in the bottom right corner of the plot, and low quality tending to be in the top right corner of the plot. There is a tendency that high alcohol percentage and low density may lead to high quality. That means the combination of density and alcohol has an obvious relationship with quality.

We can see a positive correlation between free sulfur dioxide and total sulfur dioxide.But the quality seems to has little relationship with free sulfur dioxide and total sulfur dioxide. As a whole, I can say the combination of free sulfur dioxide and total sulfur dioxide has no meaningful relationship with quality.

We can’t see some obvious correlation between density and total sulfur dioxide. The quality seems to has little relationship with free sulfur dioxide and total sulfur dioxide. As a whole, I can say the combination of density and total sulfur dioxide has no meningful relationship with quality.

Multivariate Analysis

I find that wines with high alcohol percentage and low density likely to get high quality.

As alcohol, density, chlorides, non_free sulfur dioxide, free sulfur dioxide ratio all have absolute value of correlation coefficients above 0.2 with quality. I think they are the main features that influence the quality of whit wine. So I will use these 5 variables to create models for predicting the quality by these 5 features. Besides, I will also use all the variables to create models, and see the difference.

Since there are several calssical methods for classification, I will try decision tree, svm, naive Bayse and random forest to create models and compare each prediction accuracy. In addition, I divide the data into 2 parts randomly, the with 70% of all data is used to train the model, while the other 30% is used to test the model.

Here is the result:

## the accuracy of decision tree model with selected features is: 0.4996596
## the accuracy of decision tree model with all features is: 0.5105514
## the accuracy of SVM model with selected features is: 0.5275698
## the accuracy of SVM model with all features is: 0.5786249
## the accuracy of naive Bayse model with selected features is: 0.4765146
## the accuracy of naive Bayse model with all features is: 0.4431586
## the accuracy of random forest model with selected features is: 0.6541865
## the accuracy of random forest model with all features is: 0.6773315

We can see that random forest model with all features has the highest accuracy of all, about 0.68. For there are 7 kind of quality(3 to 9) to choose in this data set, I think a accuracy of 0.64 is not a bad result. So, in this project, random forest model may be the best method for prediction.

Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Description One

All the qualities are between 3 and 9, and distribute normally. The quality of 6 gets the largest frequency of all, while quality of 9 gets the smallest. Since there are 4898 observations, we can see that over 40% of them get the quality of 6. And nealy one third of them get the quality of 5.

Plot Two

Description Two

We can see a strong positive correlation between the amount of residual sugar and water density. There is a clear tendency that higher residual sugar quantity leading to higher water density.

Plot Three

Description Three

We can see that wines with high quality tending to be with high alcohol percentage and low water density, and low quality tending to be with low alcohol percentage and high water density. We can conclude the tendency that high alcohol percentage and low density may lead to high quality.

Reflection

The White wine dataset contains 4898 observations with 11 feature variables and one label variable(quality). I aim to find out which chemical properties affect wine quality, and try to create a simple model for predicting the quality by variables found before.

I’ve successfully find out the relationship between quality and the combination of alcohol and density. But I could hardly to find out relationships between quality and other combinations. I think it might need some appropriate transformations on variables to find out obvious relationships.
For further exploration, I think variables like vintage, storage methoda and raw material can be include, which may have great effect on quality in my opinion.